coin flip
Exploring the limits of strong membership inference attacks on large language models
State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pretrained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings.
Appendix
In this section we motivate the design choices and inductive biases that we encode into our neural encoder network e, which is the network that is used to model the relative accuracies of the weak supervision sources ฮป. Recall that we model the probability of a particular sample x X having the class label y Y = {1,...,C}as Pฮธ(y|ฮป) = softmax(s)yP(y), (4) s = ฮธ(ฮป,x)Tฮป RC . Connection to prior PGM models We now motivate this choice by deriving a less expressive variant of it from the standard Markov Random Field (MRF) used in the related work. If we view the attention scores ฮธ(ฮป,x) Rm, that assign sample-dependent accuracies to each labeling function, as sample-independent parameters ฮธ1 and, by that, drop the features from the equation - as is done in the related work [30, 32, 19, 11] - we can rewrite Eq. 4 as exp ฮธT1 1 {ฮป = y} P We can recognize Pฮธ as a distribution from the exponential familiy, and more specifically as a pairwise MRF, or factor graph, with canonical parameters ฮธ = (ฮธ1,ฮธ2) and corresponding sufficient statistics, or factors, ฯ(ฮป,y) = (ฯ1(ฮป,y),ฯ2(ฮป)), as well as the log partition function Zฮธ. The accuracy factors and parameters ฯ1,ฮธ1 are the core component of this model and sometimes take the form ฯ1(ฮปy) = ฮปy in binary models as in [30, 19, 11]. The label-independent factors ฯ2(ฮป) have, as can be seen from the derivation above, no direct influence on the latent label posterior, but are often used to model labeling propensities 1 {ฮป 6= 0}and correlation dependencies 1 {ฮปi = ฮปj}, which can be important for PGM parameter learning, but are susceptible to misspecifications [39, 11, 8].
Appendix A Posterior Reparameterization
In this section we motivate the design choices and inductive biases that we encode into our neural encoder network e, which is the network that is used to model the relative accuracies of the weak supervision sources ฮป. Recall that we model the probability of a particular sample x X having the class label y Y = {1,..., C} as P Our own parameterization therefore is a more expressive variant of these latent-variable PGM models, where we are able to assign LF accuracies on a sample-by-sample basis. Furthermore, our neural encoder network outputs them as a function of the LF outputs and features, and is expected to learn the easy to misspecify dependencies and label-independent statistics implicitly. The top 2 performance scores are highlighted as First, Second. Triplet-median [11] is not listed as it only converged for IMDB with 12 LFs (F1 = 73.0
A Approximate Behavior of Metrics on Sequential Data
How do different metrics behave when used to measure autoregressive model outputs? A.1 Per-T oken Error Probability is Resolution-Limited Here, resolution refers to "the smallest interval measurable After F coin flips, we can only resolve the coin's probability of A.3), we ignore how likely the language model is to over-348 Section 3.2 of [23] gives the exact definition, but the Simulations show that as the per-token error probability slightly increase (e.g. from 0.05 to 0.1), the ROUGE-L-Sum metric sharply falls.Figure 10: Induced emergent MNIST classification ability in convolutional networks.
Fake News in Social Networks
Aymanns, Christoph, Foerster, Jakob, Georg, Co-Pierre, Weber, Matthias
We propose multi-agent reinforcement learning as a new method for modeling fake news in social networks. This method allows us to model human behavior in social networks both in unaccustomed populations and in populations that have adapted to the presence of fake news. In particular the latter is challenging for existing methods. We find that a fake-news attack is more effective if it targets highly connected people and people with weaker private information. Attacks are more effective when the disinformation is spread across several agents than when the disinformation is concentrated with more intensity on fewer agents. Furthermore, fake news spread less well in balanced networks than in clustered networks. We test a part of our findings in a human-subject experiment. The experimental evidence provides support for the predictions from the model, suggesting that the model is suitable to analyze the spread of fake news in social networks.
How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation
Yang, Hao, Zhao, Qinghua, Li, Lei
Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT's operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at https://anonymous.4open.science/r/cot-D247.
Here's how to generate a truly random number with quantum physics
Breakthroughs, discoveries, and DIY tips sent every weekday. Very little in this life is truly random. A coin flip is influenced by the flipper's force, its surrounding airflow, and gravity. Similar variables dictate rolling a pair of dice or shuffling a deck of cards, while even classical computing's cryptographic algorithms are theoretically susceptible to outside influence or bias. "True randomness is something that nothing in the universe can predict in advance," explained Krister Shalm, a physicist at the National Institute of Standards and Technology (NIST).
Enough Coin Flips Can Make LLMs Act Bayesian
Gupta, Ritwik, Corona, Rodolfo, Ge, Jiaxin, Wang, Eric, Klein, Dan, Darrell, Trevor, Chan, David M.
Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.
Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation
Yang, Hao, Zhao, Qianghua, Li, Lei
Chain-of-Thought prompting has significantly enhanced the reasoning capabilities of large language models, with numerous studies exploring factors influencing its performance. However, the underlying mechanisms remain poorly understood. To further demystify the operational principles, this work examines three key aspects: decoding, projection, and activation, aiming to elucidate the changes that occur within models when employing Chainof-Thought. Our findings reveal that LLMs effectively imitate exemplar formats while integrating them with their understanding of the question, exhibiting fluctuations in token logits during generation but ultimately producing a more concentrated logits distribution, and activating a broader set of neurons in the final layers, indicating more extensive knowledge retrieval compared to standard prompts. Our code and data will be publicly avialable when the paper is accepted.